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ABSTRACT 



This invention relates to customized electronic identification 
of desirable objects, such as news articles, in an electronic 
media environment, and in particular to a system that 
automatically constructs both a "target profile" for each 
target object in the electronic media based, for example, on 
the frequency with which each word appears in an article 
relative to its overall frequency of use in all articles, as well 
as a "target profile interest summary" for each user, which 
target profile interest summary describes the user's interest 
level in various types of target objects. The system then 
evaluates the target profiles against the users' target profile 
interest summaries to generate a user-customized rank 
ordered listing of target objects most likely to be of interest 
to each user so that the user can select from among these 
potentially relevant target objects, which were automatically 
selected by this system from the plethora of target objects 
that are profiled on the electronic media. Users* target profile 
interest summaries can be used to efl5ciently organize the 
distribution of information in a large scale system consisting 
of many users interconnected by means of a communication 
network. Additionally, a cryptographically-based pseud- 
onym proxy server is provided to ensure the privacy of a 
user's target profile interest siunmary, by giving the user 
control over the ability of third parties to access this sum- 
mary and to identify or contact the user. 

15 Claims, 13 Drawing Sheets 
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number determined by calculating the statistical variance of mous control of an information server suggests how a 
the profiles of all target objects in a cluster, is termed a special discount can be issued to a user's pseudonym and 
"cluster variance," (k.) a real number determined by calcu- that such a digital credential is provided to the user as a 
lating the maximum distance between the profiles of any two result of his/her user profile making him/her eligible. The 
target objects in a cluster, is termed a "cluster diameter" 5 user may thus present this type of credential to the appro- 
The system for electronic identification of desirable priate vendor to take advantage of the discount. This tech- 
objects of the present invention automatically constructs nique can be extended also to smart cards wherein the digital 
both a target profile for each target object in the elecU-onic credential providing the discount is downloaded from the 
media based, for example, on the frequency with which each client to the smart card and upon presentation, the vendor 
word appears in an article relative to its overall frequency of 10 may if desired, delete the credential upon redemption by the 
use in all articles, as well as a "target profile interest user. These discount credentials may similarly include any 
summary** for each user, which target profile interest sum- of the discount types (customized promotions) herein dis- 
mary describes the user's interest level in various types of closed wherein each purchase may identified (characterized) 
target objects. The system then evaluates the target profiles and credentialized by the vendor onto the user's smart card 
against the users* target profile interest summaries to gen- 15 and/or the vendor's system. 

erate a user-customized rank ordered listing of target objects In the preferred embodiment of the invention, the system 

most likely to be of interest to each user so that the user can for customized electronic identification of desirable objects 

select from among these potentially relevant target objects, uses a fundamental methodology for accurately and efiB- 

which were automatically selected by this system from the cienlly matching users and target objects by automatically 

plethora of target objects available on the electronic media. 20 calculating, using and updating profile information that 

Because people have multiple interests, a target profile describes both the users* interests and the target objects* 

interest summary for a single user must represent multiple characteristics. The target objects may be published articles, 

areas of interest, for example, by consisting of a set of purchasable items, or even other people, and their properties 

individual search profiles, each of which identifies one of the are stored, and/or represented and/or denoted on the elec- 

user's areas of interest. Each user is presented with those 25 tronic media as (digital) data. Examples of target objects can 

target objects whose profiles most closely match the user's include, but are not limited to: a newspaper story of potential 

interests as described by the user's target profile interest interest, a movie to watch, an item to buy, e-mail to receive, 

summary. Users* target profile interest summaries are auto- or another person to correspond with. In one suggested 

matically updated on a continuing basis to reflect each user's application, the user is a sender of email (which may have 

changing interests. In addition, target objects can be grouped 30 originated from the user for or from another external source 

into clusters based on their similarity to each other, for such as from outside of a large organization) and the target 

example, based on similarity of their topics in the case where objects are users who might be considered most appropriate 

the target objects are published articles, and menus auto- based upon previous messages which they have received, 

matically generated for each cluster of target objects to allow read and responded to. Accordingly, like other target objects, 

users to navigate throughout the clusters and manually 35 users (or user pseudonyms) in accordance with their user 

locate target objects of interest. For reasons of confidenti- profiles (or portions of which they have disclosed) may be 

ality and privacy, a particular user may not wish to make organized and browsed within an automatically generated 

public all of the interests recorded in the user's target profile menu tree, which is below described in detail. In all these 

interest summary, particularly when these interests are deter- cases, the information delivery process in the prefened 

mined by the user's purchasing patterns. The user may 40 embodiment is based on determining the similarity between 

desire that all or part of the target profile interest summary a profile for the target object and the profiles of target objects 

be kept confidential, such as information relating to the for which the user (or a similar user) has provided positive 

user's political, religious, financial or purchasing behavior; feedback in the past. The individual data that describe a 

indeed, confidentiality with respect to purchasing behavior target object and constitute the target object's profile are 

is the user's legal right in many states. It is therefore 45 herein termed "attributes" of the target object. Attributes 

necessary that data in a user's target profile interest summary may include, but are not limited to, the following: (1) long 

be protected from unwanted disclosure except with the pieces of text (a newspaper story, a movie review, a product 

user's agreement. At the same time, the user*s target profile description or an advertisement), (2) short pieces of text 

interest summaries must be accessible to the relevant servers (name of a movie's director, name of town from which an 

that perform the matching of target objects to the users, if the so advertisement was placed, name of the language in which an 

benefit of this matching is desired by both providers and article was written), (3) numeric measurements (price of a 

consumers of the target objects. The disclosed system pro- product, rating given to a movie, reading level of a book), (4) 

vides a solution to the privacy problem by using a proxy associations with other types of objects (list of actors in a 

server which acts as an intermediary between the informa- movie, list of persons who have read a document). Any of 

lion provider and the user. The proxy server dissociates the 55 these attributes, but especially the numeric ones, may cor- 

user's true identity from the pseudonym by the use of relate with the quality of the target object, such as measures 

cryptographic techniques. The proxy server also permits of its popularity (how often it is accessed) or of user 

users to control access to their target profile interest sum- satisfaction (number of complaints received), 

maries and/or user profiles, including provision of this The preferred embodiment of the system for customized 

information to marketers and advertisers if they so desire, 60 electronic identification of desirable objects operates in an 

possibly in exchange for cash or other considerations. Mar- electronic media environment for accessing these target 

keters may purchase these profiles in order to target adver- objects, which may be news, electronic mail, other pub- 

lisements to particular users, or they may purchase partial lished documents, or product descriptions. The system in its 

user profiles, which do not include enough information to broadest construction comprises three conceptual modules, 

identify the individual users in question, in order to carry out 65 which may be separate entities distributed across many 

standard kinds of demographic analysis and market research implementing systems, or combined into a lesser subset of 

on the resulting database of partial user profiles. Pseudony- physical entities. The specific embodiment of this system 
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disclosed herein illustrates the use of a first module which 
automatically constructs a "target profile" for each target 
object in the electronic media based on various descriptive 
attributes of the target object. A second module uses interest 
feedback from users to construct a "target profile interest 5 
summary" for each user, for example in the form of a "search 
profile set" consisting of a plurality of search profiles, each 
of which corresponds to a single topic of high interest for the 
user. The system further includes a profile processing mod- 
ule which estimates each user's interest in various target jq 
objects by reference to the users' target profile interest 
summaries, for example by comparing the target profiles of 
these target objects against the search profiles in users* 
search profile sets, and generates for each user a customized 
rank-ordered listing of target objects most likely to be of ^5 
interest to that user. Each user's target profile interest 
summary is automatically updated on a continuing basis to 
reflect the user*s changing interests. 

Target objects may be of various sorts, and it is sometimes 
advantageous to use a single system that delivers and/or 20 
clusters target objects of several distinct sorts at once, in a 
unified framework. For example, users who exhibit a strong 
interest in certain novels may also show an interest in certain 
movies, presumably of a similar nature. A system in which 
some target objects are novels and other target objects are 25 
movies can discover such a correlation and exploit it in order 
to group particular novels with particular movies, e.g., for 
clustering purposes, or to recommend the movies to a user 
who has demonstrated interest in the novels. Similarly, if 
users who exhibit an interest in certain World Wide Web 30 
sites also exhibit an interest in certain products, the system 
can match the products with the sites and thereby recom- 
mend to the marketers of those products that they place 
advertisements at those sites, e.g., in the form of hypertext 
links to their own sites. The presently described system 35 
explains the techniques for target advertising (on a user by 
user basis) through both links from advertisements on a web 
page which tends to be visited by the most likely buyers of 
that particular product or service, and routing advertise- 
ments to such users via email. (This assumes that be cause 40 
user visitorship is measured at the level of the web page, 
certain pages within the web site may be more appropriate 
for certain advertisements due to the slight difierences in its 
visitorship. Text chat(or acoustic voice chat) using a text to 
speech conversion module may be used in conjunction with 45 
real lime profiling of the real time user dialogues occurring 
within that chat session. Advertisements which are relevant 
nature of the content being discussed at present may provide 
temporary links to the appropriate product such that when 
the nature of the content changes the advertisemems changes 50 
(may disappear) accordingly. 

The abiUty to measure the similarity of profiles describing 
target objects and a user's interests can be applied in two 
basic ways: filtering and browsing. Filtering is useful when 
large numbers of target objects are described in the elec- 55 
tronic medias pace. These target objects can for example be 
articles that are received or potentially received by a user, 
who only has time to read a small fraction of them. For 
example, one might potentially receive all items on the AP 
news wire service, all items posted to a number of news 60 
groups, aU advertisements in a set of newspapers, or all 
unsohcited electronic mail, but few people have the time or 
inclination to read so many articles. A filtering system in the 
system for customized electronic identification of desirable 
objects automatically selects a set of articles that the user is 65 
likely to wish to read. The accuracy of this filtering system 
improves over time by noting which articles the user reads 
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and by generating a measurement of the depth to which the 
user reads each article. This information is then used to 
update the user's target profile interest summary. Browsing 
provides an alternate method of selecting a small subset of 
a large number of target objects, such as articles. Articles are 
organized so that users can actively navigate among groups 
of articles by moving from one group to a larger, more 
general group, to a smaller, more specific group, or to a 
closely related group. Each individual article forms a one- 
member group of its own, so that the user can navigate to 
and from individual article s as well as larger groups. The 
methods used by the system for customized electronic 
identification of desirable objects allow articles to be 
grouped into clusters and the clusters to be grouped and 
merged into larger and larger clusters. These hierarchies of 
clusters then form the basis for menuing and navigational 
systems to allow the rapid searching of large numbers of 
articles. This same clustering technique is applicable to any 
type of target objects that can be profiled on the electronic 
media such as product selections within a menu or through- 
out the World Wide Web. 

There are a number of variations on the theme of devel- 
oping and using profiles for article retrieval. Variations of 
this basic system are disclosed and comprise a system to 
filler electronic mail, an extension for retrieval of target 
objects such as purchasable items which may have more 
complex descriptions, a system to automatically build and 
alter menuing systems for browsing and searching through 
large numbers of target objects, and a system to construct 
virtual communities of people with common interests. These 
intelligent filters and browsers are necessary to provide a 
truly passive, intelligent system interface. A user interface 
that permits intuitive browsing and filtering represents for 
the first time an intelligent system for determining the 
afiGnities between users and target objects. The detailed, 
comprehensive target profiles and user-specific target profile 
interest summaries enable the system to provide responsive 
routing of specific queries for user information access. The 
information maps so produced and the application of users* 
target profile interest summaries to predict the infonmalion 
consumption patterns of a user allows for pre-caching of 
data at locations on the data communication network and at 
limes that minimize the trafiBc flow in the communication 
network to thereby efiQciently provide the desired informa- 
tion to the user and/or conserve valuable storage space by 
only storing those target objects (or segments thereof) which 
are relevant to the user's interests. 

BRIEF DESCRIPTION OF THE DRAWING 

FIG. 1 illustrates in block diagram form a typical archi- 
tecture of an electronic media system in which the system 
for customized electronic identification of desirable objects 
of the present invention can be implemented as part of a user 
server system; 

FIG. 2 illustrates in block diagram form one embodiment 
of the system for customized electronic identification of 
desirable objects; 

FIGS. 3 and 4 illustrate typical network trees; 

FIG. 5 illustrates in flow diagram form a method for 
automatically generating article profiles and an associated 
hierarchical menu system; 

FIGS. 6-9 illusu-ate examples of menu generating pro- 
cess; 

FIG. 10 illustrates in flow diagram form the operational 
steps taken by the system for customized electronic identi- 
fication of desirable objects to screen articles for a user; 
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FIG. 11 illustrates a hierarchical cluster tree example; 

FIG. 12 illustrates in flow diagram form the process for 
determination of likelihood of interest by a specific user in 
a selected target object; 

FIGS. 13A-B iUustrate in flow diagram form the auto- ^ 
matic clustering process; 

FIG. 14 illustrates in flow diagram form the use of the 
pseudonymous server; 

FIG. 15 illustrates in flow diagram form the use of the 
system for accessing information in response to a user 
query; and 

FIG. 16 illustrates in flow diagram form the use of the 
system for accessing information in response to a user query 
when the system is a distributed network implementation, 

DETAILED DESCRIPTION 

MEASURING SIMILARITY 

This section describes a general procedure for automati- 
cally measuring the similarity between two target objects, or, 
more precisely, between target profiles that are automatically 
generated for each of the two target objects. This similarity 
determination process is applicable to target objects in a 
wide variety of contexts. Target objects being compared can 
be. as an example but not limited to: textual documents. 25 
human beings, movies, or mutual funds. It is assumed that 
the target profiles which describe the target objects are 
stored at one or more locations in a data communication 
network on data storage media associated with a computer 
system. 

The computed similarity measurements serve as input to 
additional processes, which function to enable human users 
to locate desired target objects using a large computer 
system. TThese additional processes estimate a human user's 
interest in various target objects, or else cluster a plurality of 
target objects in to logically coherent groups. The methods 
used by these additional processes might in principle be 
implemented on either a single computer or on a computer 
network. Jointly or separately, they form the underpinning 
for various sorts of database systems and information ^ 
retrieval systems. 
Target Objects and Attributes 

In classical Information Retrieval (IR) technology, the 
user is a literate human and the target objects in question are 
texmal documents stored on data storage devices intercon- 
nected to the user via a computer network. That is, the target 
objects consist entirely of text, and so are digitally stored on 
the data storage devices within the computer network. 
However, there are other target object domains that present 
related retrieval problems that are not capable of being 
solved by present information retrieval technology which 
are applicable to targeting of articles and advertisements to 
readers of an on-line newspaper: 

(a.) the user is a film buff and the target objects are movies 
available on videotape. 

(b.) the user is a consumer and the target objects are used 
cars being sold. 

(c.) the user is a consumer and the target objects are 
products being sold through promotional deals. 

(d.) the user is an investor and the target objects are 
publicly traded stocks, mutual funds and/or real estate 
properties. 

(e.) the user is a student and the target objects are classes 
being offered, 65 

(f.) the user is an activist and the target objects are 
Congressional bills of potential concern. 
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(g.) the user is about to send an e-mail message and the 
target objects are potential recipients who are interested 
in the content of that message. 

(h.) the user is a corporate receptionist receiving incoming 
e-mail, voice mail or live telephone calls and the target 
objects are the employees which are the most qualified 
to handle those incoming media. 

(i.) the user is a net-surfer and the target objects are links 
to pages, servers, or newsgroups available on the World 
Wide Web which are linked from pages and articles in 
the on-line newspaper. 

(j.) the user is a philanthropist and the target objects are 
charities. 

(k.) the user is ill and the target objects are ads for medical 
specialists. 

(1.) the user is an employee and the target objects are 
classifieds for potential employers. 

(m.) the user is an employer and the target objects are 
classifieds for potential employees. 

(n.) the user is a lonely heart and the target objects are 
classifieds for potential conversation partners. 

(o.) the user is in search of an expert and the target objects 
are users, with known retrieval habits, of an document 
retrieval system. 

(p.) the user is in need of insurance and the target objects 
arc classifieds for insurance policy offers. 

In all these cases, the user wishes to locate some small 
subset of the target objects — such as the target objects that 
the user most desires to rent, buy, investigate, meet, read, 
give mammograms to, insure, and so forth. The task is to 
help the user identify the most interesting target objects, 
where the user's interest in a target object is defined to be a 
numerical measurement of the user's relative desire to locate 
that object rather than others. 

The generaUty of this problem motivates a general 
approach to solving the information retrieval problems noted 
above. It is assumed that many target objects are known to 
the system for customized electronic identification of desir- 
able objects, and that specifically, the system stores (or has 
the ability to reconstruct) several pieces of information 
about each target object. These pieces of information are 
termed "attributes": 

collectively, they are said to form a profile of the target 
object, or a "target profile," For example, where the system 
for customized electronic identification of desirable objects 
is activated to identify selections of interest in a particular 
category of on-line products for review or purchase by the 
user, it can be appreciated that there are certain unique sets 
of attributes which are pertinent to the particular product 
category of choice. For the application as part of a movie 
critic column (where the system identifies novel titles and 
reviews which are most interesting to the user) the system is 
likely be concerned with the values of attributes such as 
these: 

(a.) title of movie, 
(b.) name of director, 

(c.) Motion Picture Association of America (MPAA) 

child-appropriatcncss rating (0«G, 1=PG, . . . ), 
(d.) date of release. 

(e.) number of stars granted by a particular critic, 
(f.) number of stars granted by a second critic, 
(g.) number of stars granted by a third critic. 
For example, a customized financial news column may be 
presented to the user in the form of articles which are of 
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interest to the user. In this case, however, an accordingly 
those stocks which are most interesting to the user may be 
presented as well. 

(h.). full text of review by the third critic, 

(i.). list of customers who have previously rented this 5 
movie, 

(j.) list of actors. 

Each movie has a different set of values for these 
attributes. This example conveniently illustrates three kinds 
of attributes. Attributes c-g are numeric attributes, of the lo 
sort that might be found in a database record. It is evident 
that they can be used to help the user identify target objects 
(movies) of interest. For example, the user might previously 
have rented many Parental Guidance (PG) films, and many 
films made in the 1970*s. This generalization is useful: new i5 
films with values for one or both attributes that are numeri- 
cally similar to these (such as MPAA rating of 1, release date 
of 1975) are judged similar to the films the user already 
likes, and therefore of probable interest. Attributes a-b and 
h are textual attributes. They too are important for helping 20 
the user locate desired films. For example, perhaps the user 
has shown a past interest in films whose review text 
(attribute h) contains words like "chase," "explosion," 
"explosions," "hero," "gripping," and "superb." This gen- 
eralization is again useful in identifying new films of inter- 25 
est. Attribute i is an associative attribute. It records associa- 
tions between the target objects in this domain, namely 
movies, and ancillary target objects of an entirely different 
sort, namely humans. A good indication that the user wants 
to rent a particular movie is that the user has previously 30 
rented other movies with similar attribute values, and this 
holds for attribute 1 just as it does for attributes a-h. For 
example, if the user has often liked movies that customer 
and customer C^go h^^^ rented, then the user may like 
other such movies, which have similar values for attribute i. 35 
Attribute j is another example of an associative attribute, 
recording associations between target objects and actors. 
Notice that any of these attributes can be made subject to 
authentication when the profile is constructed, through the 
use of digital signatures; for example, the target object could 40 
be accompanied by a digitally signed note from the MPAA, 
which note names the target object and specifies its authentic 
value for attribute c. 

These three kinds of attributes are common: numeric, 
textual, and associative. In the classical information retrieval 45 
problem, where the target objects are documents (or more 
generally, coherent document sections extracted by a text 
segmentation method), the system might only consider a 
single, textual attribute when measuring similarity: the full 
text of the target object. However, a more sophisticated 50 
system would consider a longer target profile, including 
numeric and associative attributes: 

(a.) full text of document (textual), 

(b.) title (textual), 

(c.) author (textual), 

(d.) language in which document is written (textual), 
(e.) dale of creation (numeric), 
(f.) date of last update (numeric), 
(g ) length in words (numeric), 
(h.) reading level (numeric), 
(i.) quality of document as rated by a third party editorial 

agency (numeric), 
(j.) list of other readers who have retrieved this document 

(associative). 65 
As another domain example, consider a domain where the 
user is an advertiser and the target objects are potential 
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customers. The system might store the following attributes 
for each target object (potential customer): 

(a.) first two digits of zip code (textual), 

(b.) first three digits of zip code (textual), 

(c.) entire five-digit zip code (textual), 

(d.) distance of residence from advertiser's nearest physi- 
cal storefront (numeric), 

(e.) annual family income (numeric), 

(f.) number of children (numeric), 

(g.) list of previous items purchased by this potential 
customer (associative), 

(h.) list of filenames stored on this potential customer's 
client computer (associative), 

(i.) list of movies rented by this potential customer 
(associative), 

(j.) list of investments in this potential customer's invest- 
ment portfolio (associative), 

(k.) list of documents retrieved by this potential customer 
(associative), 

(1.) written response to Rorschach inkblot test (textual), 

(m.) multiple-choice responses by this customer to 20 
self-image questions (20 textual attributes). 

As always, the notion is that similar consumers buy 
similar products. It should be noted that diverse sorts of 
information are being used here to characterize consumers, 
from their consumption patterns lo their literary taste s and 
psychological peculiarities, and that this fact illustrates both 
the flexibility and power of the system for customized 
electronic identification of desirable objects of the present 
invention. Diverse sorts of information can be used as 
attributes in other domains as well (as when physical, 
economic, psychological and interest-related questions are 
used to profile the applicants to a dating service, which is 
indeed a possible domain for the present system), and the 
advertiser domain is simply an example. 

As a final domain example, consider a domain where the 
user is an stock market investor and the target objects are 
publicly traded corporations. A great many attributes might 
be used to characterize each corporation, including but not 
limited to the following: 

(a.) type of business (textual), 

(b.) corporate mission statement (textual), 

(c.) number of employees during each of the last 10 years 
(ten separate numeric attributes), 

(d.) percentage growth in number of employees during 
each of the last 10 years, 

(e.) dividend payment issued in each of the last 40 
quarters, as a percentage of current share price, 

(f.) percentage appreciation of stock value during each of 
the last 40 quarters, list of shareholders (associative), 

(g.) composite text of recent articles about the corporation 
in the financial press (textual). 
For example, a customized financial news column may be 
presented lo the user in the form of articles which are of 
interest to the user. In addition, those stocks which are most 
interesting to the user may be presented as well. 

It is worth noting some additional attributes that are of 
interest in some domains. In the case of documents and 
certain other domains, it is useful to know the source of each 
target object (for example, refereed journal article vs. UPI 
newswire article vs. Usenet newsgroup posting vs. question - 
answer pair from a question-and-answer list vs. tabloid 
newspaper article vs. . . . ); the source may be represented 
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as a single-term textual attribute. Important associative However, for lengthy textual attributes, such as the text of 

attributes for a hypertext document are the list of documents an entire document, the score of a word is typically defined 

that it links to, and the list of documents that link to it. to be not merely its tera3 frequency, but its term frequency 

Documents with similar citations are similar with respect to multiplied by the negated logarithm of the word's "global 

the former attribute, and documents that are cited in the 5 frequency," as measured with respect to the textual attribute 

same places are similar with respect to the latter. A conven- in question. The global frequency of a word, which effec- 

tion may optionally be adopted that any document also links tively measures the word's uninformativeness, is a fraction 

to itself. Especially in systems where users can choose between 0 and 1, defined to be the fraction of all target 

whether or not to retrieve a target object, a target object's objects for which the textual attribute in question contains 

popularity (or circulation) can be usefully measured as a lO this word. This adjusted score is often known in the art as 

numeric attribute specifying the number of users who have TF/IDF ("term frequency times inverse document 

retrieved that object. Related measurable numeric attributes frequency"). When global frequency of a word is taken into 

that also indicate a kind of popularity include the number of account in this way, the common, uninformative words have 

replies to a target object, in the domain where target objects scores comparatively close to zero, no matter how often or 

are messages posted to an electronic community such as an is rarely they appear in the text. Thus, their rate has little 

computer bulletin board or newsgroup, and the number of influence on the object's target profile. Alternative methods 

links leading to a target object, in the domain where target of calculating word scores include latent semantic indexing 

objects are interlinked hypertext documents on the World or probabilistic models. 

Wide Web or a similar system. A target object may also Instead of breaking the text into its component words, one 
receive explicit numeric evaluations (another kind of 20 could alternatively break the text into overlapping word 
numeric attribute) from various groups, such as the Motion bigrams (sequences of 2 adjacent words), or more generally. 
Picture Association of America (MPAA), as above, which word n-grams. These word n-grams may be scored in the 
rates movies' appropriateness for children, or the American same way as individual words. Another possibility is to use 
Medical Association, which might rate the accuracy and character n-grams. For example, this sentence contains a 
novelty of medical research papers, or a random survey 25 sequence of overiapping character 5 -grams which starts "for 
sample of users (chosen from all users or a selected set of e", "or ex", "r exa", "exam", "examp", etc. The sentence 
experts), who could be asked to rate nearly anything. Certain may be characterized, imprecisely but usefully, by the score 
other types of evaluation, which also yield numeric . of each possible character 5-gram ("aaaaa", "aaaab", . . . 
attributes, may be carried out mechanically. For example. "zzzzz") in the sentence. Conceptually speaking, in the 
the diflSculty of reading a text can be assessed by standard 30 character 5-gram case, the textual attribute would be decom- 
procedures that count word and sentence lengths, while the posed into at least 26^=11,881,376 numeric attributes. Of 
vulgarity of a text could be defined as (say) the number of course, for a given target object, most of these numeric 
vulgar words it contains, and the expertise of a text could be attributes have values of 0, since most 5-grams do not appear 
crudely assessed by counting the number of similar texts its in the target object attributes. These zero values need not be 
author had previously retrieved and read using the invention, 35 stored anywhere. For purposes of digital storage, the value 
perhaps confining this count to texts that have high approval of a textual attribute could be characterized by storing the set 
ratings from critics. Finally, it is possible to synthesize of character 5-grams that actually do appear in the text, 
certain textual attributes mechanically, for example to recon- together with the nonzero score of each one. Any 5-gram 
struct the script of a movie by applying speech recognition that is no t included in the set can be assumed to have a score 
techniques to its soundtrack or by applying optical character 40 of zero. The decomposition of textual attributes is not 
recognition techniques to its closed-caption subtitles. limited to attributes whose values are expected to be long 
Decomposing Complex Attributes texts. A simple, one-term textual attribute can be replaced by 
Although textual and associative attributes arc large and a collection of numeric attributes in exactly the same way. 
complex pieces of data, for information retrieval purposes Consider again the case where the target objects are movies, 
they can be decomposed into smaller, simpler numeric 45 The "name of director" attribute, which is texUial, can be 
attributes. This means that any set of attributes can be replaced by numeric attributes giving the scores for 
replaced by a (usually larger) set of numeric attributes, and "Federico-Fellini," "Woody- Allen," "Terence-Davies," and 
hence that any profile can be represented as a vector of so forth, in that attribute. For these one-term textual 
numbers denoting the values of these numeric attributes. In attributes, the score of a word is usually defined to be its rate 
particular, a texmal attribute, such as the full text of a movie 50 in the text, without any consideration of global frequency, 
review, can be replaced by a collection of numeric attributes Note that under these conditions, one of the scores is 1, 
that represent scores to denote the presence and significance while the other scores are 0 and need not be stored. For 
of the words "aardvark," "aback," "abacus," and so on example, if Davies did direct the film, then it is "Terence- 
through "zymurgy" in that text. The score of a word in a text Davies" whose score is 1. since "Terence-Davies" consti- 
may be defined in numerous ways. The simplest definition is 55 tutes 100% of the words in the textual value of the "name of 
that the score is the rate of the word in the text, which is director** attribute. It might seem that nothing has been 
computed by computing the number of times the word gained over simply regarding the textual attribute as having 
occurs in the text, an d dividing this number by the total the string value "Terence-Davies," However, the trick of 
number of words in the text. This sort of score is often called decomposing every non-numeric attribute into a collection 
the "term frequency" (TF) of the word. The definition of 60 of numeric attributes proves useful for the clustering and 
term frequency may optionally be modified to weight dif- decision tree methods described later, which require the 
ferent portions of the text unequally: for example, any attribute values of different objects to be averaged and/or 
occurrence of a word in the text's title might be counted as ordinally ranked. Only numeric attributes can be averaged or 
a 3-fold or more generally k-fold occurrence (as if the title ranked in this way. Just as a textual attribute may be 
had been repeated k times within the text), in order to reflect 65 decomposed into a number of component terms (letter or 
a heuristic assumption that the words in the title are par- word n-grams), an associative attribute may be decomposed 
ticularly important indicators of the text's content or topic. into a number of component associations. For instance, in a 
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